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A primary interest in any method for producing synthetic speech is to 
minimize the number of bits per second required to generate acceptable 
quality speech. An efficient method for transmitting the linear-prediction 
parameters has been found by using the techniques of differential PCM. 
Using this technique, speech transmission is achieved employing fewer 
than 1500 bits/s. Further reductions in the linear-prediction storage re- 
quirements can be realized at a cost of higher system complexity by trans- 
mission of the most significant eigenvectors of the parameters. This tech- 
nique in combination with differential PCM can lower the storage to 
1000 bits/s. 

I. INTRODUCTION 

The method of linear prediction has proved quite popular and suc- 
cessful for use in speech compression systems. 1-4 In this method, speech 
is modeled as the output of an all-pole filter H(z) that is excited by a 
sequence of pulses separated by the pitch period for voiced sounds, or 
pseudo-random noise for unvoiced sounds. These assumptions imply 
that within a frame of speech the output speech sequence is given by 

p 
s(n) = £ a k s(n — k) + w„, 

*=i 

where p is the number of modeled poles, u n is the appropriate input 
excitation, and the at's are the coefficients characterizing the filter 
(linear prediction coefficients). Figure 1 illustrates the frequency- 
domain, as well as the equivalent time-domain, model of linear-pre- 
diction speech production. To account for the nonstationary character 
of the speech waveform, the parameters a k of the modeled filter are 
periodically updated during successive speech frames.* Generation of 
speech in this method requires a knowledge of the pitch, the filter 



* A frame is a segment of speech thought adequate to assume stationarity of the 
speech process. Typical frame lengths employed range from 10 to 30 ms. 
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Fig. 1 — Discrete model of speech production as employed in linear prediction. 

parameters, and the gain of the filter (amplitude of input excitation) 
in each speech frame. 

A primary interest in any method for producing synthetic speech 
is to minimize the number of bits per second needed to generate ac- 
ceptable quality speech. The smaller the information storage require- 
ments (bits per second), the more attractive the system becomes for 
the important applications of voice answer-back and speech trans- 
mission. 6 To achieve the minimum storage requirement for a given 
system, an efficient means of quantizing the generating parameters 
must be determined. Using conventional pulse code modulation (pcm) 
techniques in which the amplitude of each parameter is uniformly 
quantized into 2 B levels, it has been found necessary to allot at least 
five bits (B = 5) of information for both pitch and gain and at least 
11 bits for each a*. 1 The corresponding storage requirements for this 
method of quantization of the linear-prediction (lpc) parameters is 
unacceptable for many applications, and an improved scheme for 
quantizing the parameters is needed. 

For the usual 12-pole linear-prediction representation, the dominant 
portion of storage is allotted to the filter coefficients (132 bits per 
frame in the pcm method of information transmission) . The extremely 
fine quantization of the a k 's is necessary because small perturbations 
in the coefficients can sometimes cause radical changes in the important 
pole frequencies of the modeled filter H(z) and may even cause the 
filter to become unstable (poles outside the unit circle). To overcome 
the limitations of quantizing the predictor (filter) coefficients, it has 
been found quite profitable to transmit different but informationally 
equivalent sets of parameters. 4 - 6 The most suitable parameters have 
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been experimentally determined to be the log-area ratio coefficients 
gi* These coefficients are nonlinearly related to the filter coefficients by 

s . = iogf±|;, a) 

where the fc.'s are termed the parcor coefficients. 1 ' 2 - 4 - 6 If we denote a}** 
as the ith. linear prediction coefficient for a jth-pole linear-prediction 
model, then 

h = a, (i) . (2) 

The parcor coefficients have the very important property that if 

|fc,| < 1, i=i ) ... )P) (3) 

then it is guaranteed that the linear-prediction filter is stable. 4 Thus, 
small perturbations in the parcor coefficients or log-area coefficients 
will not affect the stability of the modeled filter. 

Since the log-area coefficients are nonlinearly related to the filter 
coefficients, transmission of the log-area parameters is equivalent to a 
nonuniform quantization of the linear-prediction coefficients. By 
transmitting the log-area parameters, the number of bits allotted to 
the filter parameters can be effectively reduced by nearly £. 3 - 4 - 6 In this 
paper, we offer two additional methods of quantization of the necessary 
synthesis parameters (pitch, gain, and filter coefficients) that can even 
further reduce the storage requirements of a linear-prediction vocoder. 
One proposed method of quantization uses the technique of differ- 
ential pcm (dpcm) to transmit the linear-prediction parameters. This 
scheme exploits the fact that the lpc parameters are themselves pre- 
dictable from previous samples. Using this method, speech transmis- 
sion that is practically equivalent to the unquantized synthesis can 
be achieved using fewer than 2000 bits/s. 

The second method of transmission exploits the redundancy between 
the linear-prediction parameters. The lpc parameters can be predicted 
not only from the given parameter's past values, but also in some sense 
from values of the other parameters. The suggested scheme involves 
the transmission (using dpcm techniques) of the most significant eigen- 
vectors of the log-area parameters. For the typical speech utterance, 
the space of the 12 log-area coefficients can be effectively represented 
by only the first five or six eigenvectors. The transmission requirement 
for this method is fewer than 1200 bits/s. 

The organization of this paper is as follows. In Section II, we briefly 
discuss the concept of dpcm coding. In Section III, we show that dpcm 
coding offers a significant improvement over pcm coding for trans- 
mission of the linear-prediction parameters. In Section IV, the results 
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are presented of a synthetic speech experiment using the proposed 
dpcm scheme. In Section V, we discuss several methods of dpcm coding 
that are more suitable for real-time implementation. Included in this 
section is a discussion of adaptive quantization (adpcm) and adaptive 
dpcm prediction. In Section VI, we discuss the method of orthogonal 
linear prediction. The results of synthetic speech experiments are 
included in this section. Finally, we conclude with a summary and 
discussion of the results presented in the paper. 

II. DIFFERENTIAL PULSE CODE MODULATION 

The idea of differential pcm is similar in philosophy to the concept 
employed in linear-prediction speech analysis. In dpcm, we assume that 
the transmitted parameter in a given frame of interest can be estimated 
by a linear combination of the parameter in previous frames. 7 If we 
let x r denote the value of the transmission parameter x in the rth 
frame (where x can represent pitch, gain, log-area coefficients, or what- 
ever), then this assumption implies 

n 
X r « X r = E biXr-i, (4) 

where n is the order of the dpcm prediction analysis. The dpcm tech- 
nique calls for the transmission of the difference between the predicted 
value ± r and the true value x r . 

Figure 2 illustrates the structure of the dpcm coding system. In the 
implementation of a dpcm scheme, a feedback path around the quan- 
tizer is used to ensure that the error in the reconstructed (quantized) 
signal £ r is precisely the quantization error for the difference signal 
e T = x T — £ r , where £ is the predicted value based upon the quantized 
signal £ r - The predictor coefficients 6, are chosen to minimize the power 
of the difference signal e r . The mathematical analysis required for the 
solution of the optimum set of 6,'s is exactly the same as the analysis 
for the calculation of the linear-prediction coefficients, a,-, i = 1, •••,?>■ 
The determination of the 6,'s is made by solving the familiar correla- 
tion equations : 

n N N 

E &« E Xr-iXr-k = — E XrXr-k, 1 ^ k g. 71, (5) 

i=l r=n r=n 

where N is the number of frames in the utterance. 

The advantage of dpcm coding is obvious when one realizes that, 
if x r can be accurately estimated from previous samples, the informa- 
tion necessary for transmission (as expressed by the difference signal 
x r — £) is necessarily less than the information required for coding 
the signal without exploiting its predictability. The advantage of 

1696 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1975 



e r CHANNEL 



Fig. 2 — Differential pcm (qz = quantizer; pr = predictor; = x r = £j_, bjX r -j). 

dpcm coding can be precisely specified by noting that, for a given fine- 
ness of quantization, the quantization error is proportional to the 
variance of the signal present at the quantizer. 7 Thus, the improve- 
ment in performance (as measured by the frequently used standard 
of signal-to-quantization-error ratio) using dpcm strategy over straight 
pcm coding is given by the ratio of the variance (power) of x r to that 
of the difference signal 



G = 



((x r - m 



(6) 



Using the optimum predictors 6,, the resulting gain over pcm is ap- 
proximately* given by 



Gopt ~ V ~ L b ~c^) ~ 



<*;> 



((X r ~ *rY) ' 



where * 



C,= 



N 
' f %r%r— i- 



(7) 



(8) 



For equal standards of synthetic speech quality, the gain obtained 
by using a dpcm strategy over that of pcm coding can be traded off to 
reduce the rate of information transmission. Of course, for G < 1, it 
is disadvantageous to use dpcm coding. However, for the transmission 
of parameters that are reasonably smooth in their variation from one 
transmission frame to the next, it is guaranteed that dpcm coding is 
superior to pcm coding. In the next section, we demonstrate the effi- 
ciency of dpcm techniques for the coding of the linear-prediction speech 
parameters. 

III. DPCM IMPROVEMENT IN CODING LPC PARAMETER 

To illustrate the efficiency of dpcm techniques in the coding of the 
synthesis parameters, Fig. 3 shows the improvement factor G op t in 
decibels as a function of the number of dpcm predictors. The figure 



* The gain is approximate because the effects of the quantizer in Fig. 2 are ignored. 
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Fig. 3 — G pt for the sentence, "May we all learn a yellow lion roar." 

shows Gopt for the first two log-area coefficients (g-i and g 2 ),* pitch 
period and power* for the all-voiced utterance, "May we all learn a 
yellow lion roar." The improvement factor was calculated by con- 
sidering each particular parameter across the entire sentence and then 
calculating the optimum predictors using eq. (5) and G opt using eq. 
(7). The results depicted in Fig. 3 are for a male speaker, but the results 
are typical of those obtained for other male and female speakers. For 
the complete ensemble of parameters necessary to produce synthetic 
speech (12 log-area coefficients, pitch, and power),* the set of improve- 
ment factors were all significantly greater than 1. 

Figure 4 shows a typical plot of the improvement factor calculated 
for a sentence containing unvoiced sounds, "Few thieves are never 



' The parameters were calculated at the rate of 50 samples per second. The filter 

Parameter was calculated by the covariance method (Ref . 1 ), and pitch was measured 
y a method developed by B. S. Atal (Ref. 8). 

* Power is defined as the energy in the speech frame. For the synthetic system 
employed, it is more convenient to transmit power instead of the amplitude of the 
input excitation. 

* Log-area coefficients were transmitted because of their optimum quantization 
properties. 
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Fig. 4 — G op t for the sentence, "Few thieves are never sent to the jug." 

sent to the jug." In this sentence, the dpcm improvement over pcm 
coding is not as dramatic as for the all-voiced sentence. The reason 
for the decreased values of G opt is that the synthesis parameters tend 
to change sharply during the unvoiced-voiced transition. Thus, during 
the transition region there is an abrupt reduction in the correlation 
between successive samples, and very little information can be gained 
about the signal from past values. Another reason for the reduced 
values of G opt is that the variation of the lpc parameters during un- 
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Fig. 5 — Gopt for the sentence, "Few thieves are never sent to the jug." A separate 
dpcm analysis is used in each unvoiced and voiced segment. 
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voiced sounds is inherently more random than during voiced sounds 
and is thus less predictable. Fortunately, during unvoiced regions the 
quality of the synthesized speech is more tolerant to quantization 
noise than during voiced regions. 4 Thus, the diminished values of the 
(7 oP t's is not as significant as might at first appear. 

One method of increasing the improvement factor for utterances 
containing unvoiced sounds is to update the dpcm predictors when- 
ever the spectral properties of the speech signal change from unvoiced 
to voiced sounds. Figure 5 shows G op t for the same sentences as were 
used to obtain the results of Fig. 4, but in this figure the optimum 
dpcm predictors were separately calculated for each different section 
of unvoiced and voiced speech. The improvement factor for this form 
of dpcm coding is about 5 dB better than a single calculation of the 
predictors. In a later section of the paper, we discuss another method 
for updating or adapting the dpcm predictors to the changing spectral 
properties of the speech signal. 

IV. SPEECH SYNTHESIS 
4.1 Synthesizer 

The improvement factors for the lpc parameters demonstrate that 
dpcm coding is superior to pcm coding. To confirm the results of the 
G 0P t experiments, a synthetic speech system was constructed in the 
manner illustrated in Fig. 2. To take advantage of the fact that the 
improvement factor saturates near n = 1 (Figs. 3 and 5), only a simple 
first-order dpcm system was used. The optimum predictor was recom- 
puted for each separate unvoiced and voiced region and the lpc 
parameters were calculated at a rate of 50 samples per second. The 
speech was synthesized using the formulation discussed by Atal and 
Hanauer. 1 After quantization, the parameters were geometrically 
interpolated (linear interpolation on a logarithmic scale) to allow 
pitch synchronous resetting of the synthesizer. 

The quantizer used in the dpcm coding of the synthesis parameter 
was a nonuniform quantizer that was designed to exploit the properties 
of each parameter's error signal. An experimental investigation has 
indicated that the difference signal for pitch, power, and g\ are most 
suitably modeled by a zero mean gamma density, 

P er(er )=- 7 ^ T exp(-fc| er |), (9) 

2Vt \e T \ 



where 



V075 
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The higher-order log-area coefficients are more Laplacian in character : 



P, r (e,) = iexp(-M), 



where 



= V2/3. 



A signal with a gamma distribution is highly concentrated near its 
mean, but can also readily achieve values more than three standard 
deviations from its mean. A Laplacian signal is less concentrated than 
a gamma signal near its mean value. Figure 6 illustrates the statistical 
characteristics of a zero mean, unit standard, deviation signal with a 
gamma density, a Laplacian density, and a gaussian density. Figure 
7 shows a comparison between the calculated distributions for the 
difference signal of several typical synthesis parameters and their 
approximated distributions. 

For a gamma-behaved signal, the properties of the optimum quan- 
tizer are summarized in Table I. 9 The xt values in the table define the 
ends of quantizer input ranges, and the y t values are the corresponding 
outputs. Thus, for a two-bit quantizer, an input between and 1.205 is 
quantized as 0.302. Similarly, an input between 0.229 and 0.588 for a 
four-bit scheme is quantized as 0.386. The properties of the optimum 
quantizer for Laplacian signals are summarized in Table II. 9 Included 
in these tables is the expected mean square between the difference 
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Fig. 6 — Comparison of a gaussian, gamma, and Laplacian density with zero mean 
and unit standard deviation. 
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Fig. 7 — Comparison between calculated density and approximated density for 
difference signals. 

signal and the quantized difference. Thus, for a four-bit quantization 
of a gamma signal, the mean square error is 0.0196. 

Tables I and II are constructed for signals with unit standard devia- 
tion. To obtain the levels ?/,• and boundaries a\- for signals with standard 

Table I — Optimum quantizers for signals with gamma density 

0*=0, cr 2 =1) 
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Table II — Optimum quantizers for signals with Laplace density 
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deviation different from unity, simply multiply the given values by 
the actual standard deviation. * The standard deviation for each param- 
eter can be approximated as the rms power of the unquantized error 
signal. The rms value of the unquantized error signal is obtained 
directly from the calculation of the optimum dpcm predictors and is 
given by 



r 2 = Co - Z MV 



4.2 Experimental results 

Four sentences were synthesized in the experimentation : 

A. Few thieves are never sent to the jug. 

B. May we all learn a yellow lion roar. 

C. It's time we rounded up that herd of Asian cattle. 

D. Should we chase those young outlaw cowboys? 

High-quality recordings of these sentences were made by two male and 
two female speakers, and these utterances were used to obtain the 
analysis data for the dpcm coding method. 



* To obtain the mean square error, multiply the values by the signal variance. 

' Since the properties of the unquantized error signal are explicitly known, it is 
sometimes advantageous to use a more complex nonuniform quantizer to truly 
optimize the transmission system. 
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Various schemes were tested for assigning bit rates for each indi- 
vidual error signal. From informal listening experiments, it was de- 
termined that synthetic speech that was negligibly different from the 
unquantized synthesis could be generated according to the following 
bit assignment : 

Pitch : 3 bits/frame 
Power : 3 bits/frame 
Unvoiced-voiced : 1 bit /frame 
<7i : 4 bits/frame 
g 2 : 4 bits/frame 
g 3 : 4 bits/frame 
04 : 4 bits/frame 
g& : 3 bits /frame 
06 : 3 bits/frame 
g 7 : 2 bits/frame 
08 : 2 bits/frame 
09 : 2 bits/frame 
0io : 1 bit /frame 
0u : 1 bit /frame 
0i2i 1 bit/frame 

The total number of bits dedicated to the complete set of lpc param- 
eter is only 38 bits/frame or 1900 bits/s. On the average, an additional 
100 bits/s are required to transmit the necessary dpcm information 
(dpcm predictors, standard deviations, and initial values of the lpc 
parameters). As can be observed from Figs. 8, 9, and 10, the spectro- 
gram of the dpcm synthetic speech closely resembles that of the un- 
quantized synthetic speech but requires only a fraction of the storage. 
As the bit rate for the dpcm linear prediction vocoder is lowered 
below the value of 2000 bits/s, the quality of the synthesis slowly 
begins to deviate from that of the unquantized synthesis. Since the 
log-area parameters are approximately ordered in terms of their sensi- 
tivity, the most expandable bits are those allotted to the lower-ordered 
0,'s. 4 Depending on the speaker and the utterance, the bit rate can be 
lowered to between 900 and 1500 bits/s and still allow acceptable 
quality synthesis.' Figures 11, 12, and 13 illustrate the above examples 
for a bit-rate of 1400 bits/s (3; 3; 1 ; 4, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1). The 
synthetic speech in these examples is slightly degraded from the un- 
quantized synthesis, but the speech is still readily understood and the 
vocal attributes of the speaker are still apparent. It should be appre- 



* Acceptable quality speech synthesis is defined as speech containing all the in- 
formation content of the original without containing any annoying degradation in 
speech quality. 
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ciated that the necessary storage requirements to produce acceptable 
quality synthetic speech in this method are nearly | the requirement 
for the pcm transmission of the lpc parameters (see Section I) . 

V. REAL-TIME DPCM TRANSMISSION 

The dpcm scheme developed in Section III suffers from the draw- 
back that the calculation of the dpcm predictors and the quantizer 
step size are delayed until all the lpc parameters are available. For 
real-time speech synthesis, it is desirable that the process of parameter 
transmission be done concurrently with the measurement of the lpc 
parameters. In this section, we discuss several schemes for achieving 
real-time transmission while still retaining almost the performance of 
the optimum dpcm strategy. 

5.1 Average statistical system 

The first means of obtaining a real-time system is based upon the 
observation that the optimum dpcm first-order predictor for many of 
the lpc parameters is nearly equal to one [61 = 1.0 in eq. (4)]. Thus, 
the optimum linear prediction of the parameter x T is approximately 
x r -i. Table III is a comparison of the improvement factors G op t ob- 
tained for 61 = 1.0 and bi set equal to the optimum value. The overall 
improvement factors for 6 X = 1.0 are not significantly different from 
the optimum values, and the delay in calculating the optimum &i can 
be avoided by simply letting 61 = 1.0. 

To design the optimum quantizer, it is necessary to know the stan- 
dard deviation of the signal to be quantized. However, our statistical 
studies have indicated that the standard deviation of the various 
difference signals are quite stable across different utterances and 
different speakers. Table IV shows the measured standard deviations 
for each difference signals computed with 6 X = 1.0. Table IV also 

Table III — Comparison of G opt in decibels with Oi set equal 

to optimum value and bi = 1.0. Sentence A is "Few 

thieves are never sent to the jug" and sentence 

B is "May we all learn a yellow lion roar." 





Pitch 


Power 


ffi 


9i 


Sentence A 
bi = Optimum 
bi = 1.0 

Sentence B 

61 = Optimum 

6, = 1.0 


23.7 
20.2 

33.8 
33.1 


12.2 
10.1 

19.0 

18.8 


14.7 
14.1 

24.0 
23.9 


12.2 
11.0 

19.6 
19.2 
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Table IV — Measured standard deviations for the 
synthesis parameters 



Updated 



No Updating 



Pitch Period 

Power 

di 

!h 

Qt 

Q* 

Pi 

Pi 

ill 

9* 

<h 

1/10 

f/11 



13.01 
27 X 10 B 
0.697 
0.729 
0.509 
0.510 
0.413 
0.417 
0.386 
0.385 
0.377 
0.346 
0.332 
0.322 



16.5 
27 X 10 8 
0.959 
0.830 
0.559 
0.554 
0.446 
0.430 
0.406 
0.406 
0.399 
0.364 
0.342 
0.328 



contains the standard deviation for a system in which the prediction 
scheme is not updated for each unvoiced and voiced region. 

Using the standard deviations listed in Table IV and the quantizer 
discussed in Section IV, a robust transmission scheme is achieved. 
For example, the difference signal for the pitch period can be accurately 
quantized for differences as small as two samples or as large as 50 for 
three-bit quantization.* The synthetic speech quality for the average 
statistical system compares quite favorably to the optimum scheme, 
and has the added advantage of real-time implementation. 

5.2 Adaptive system 
5.2.1 Adaptive DPCM prediction 

The dpcm predictors can also be calculated without knowing the 
entire sequence of parameters by an adaptive method that is based 
upon the technique of "steepest descent." 11 In this scheme, an initial 
estimate of the dpcm predictors is determined and then a new set of 
predictors is calculated to reduce the prediction error. The perturba- 
tion in the predictors is in a direction opposite the gradient of the pre- 
diction error taken with respect to the dpcm predictor vector. The 
resulting perturbation is given by 



5 r (6 y ) = B-sgn (e r )-£ r -j/ £ |*r-*| 



(10) 



where B is the adaptation rate (typically, B = 0.09), and £ r is the 

* If a nonlinear smoothing algorithm (Ref. 10) is applied to the raw pitch measure- 
ments, the variance of the corresponding difference signal is reduced by more than 
i. A two-bit quantization can then be used for pitch without diminishing the quality 
of the synthesis. 



1712 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1975 



quantizer value of the parameter. For the prediction of the (r + l)th 
sample of the parameter, the dpcm predictors are given by 

6F/M = 65 + fi'(6,-). (11) 

For a quantizer with B ^ 2, it can be shown that the adaptation 
scheme will match the changing spectral properties of the speech signal 
and result in near-optimum performance. 12 For the two methods given 
above, it should be noted that, in addition to the on-line calculation 
of the dpcm predictors, it is unnecessary to transmit the predictors. 

5.2.2 Adaptive quantization 

In the previous section, the quantizer was constructed to take ad- 
vantage of the known properties or average statistical properties of 
each parameter's difference signal. In this part of the paper, we intro- 
duce an alternate technique for estimating the signal variance. This 
method is based upon an adaptive approach developed by Cummiskey, 
Jayant, and Flanagan. 13 In their scheme, a simple uniform quantiza- 
tion of the difference signal is used, but the step size for every new 
input is varied by a factor depending on which quantizer slot was 
occupied by the previous sample. Numbering the quantizer slots in 
the manner shown in Fig. 14, the updated step size A r+J is calculated 
from the previous step size A r by 



A r+1 = A r -M(\H r \), 



(12) 



where H T = 1, 2, • • -, B and the multiplier function M( ) is a time- 
invariant function of the quantizer slot number. 



5d, 
2 

3A r 
2 

A, -I 



H, --1 



H, = -2 



H r = -3 



A 



21, 



Fig. 14 — Numbering of quantizer slots for adaptive quantization. 
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To adequately match the step size to the signal variance, the mul- 
tiplier function must be properly chosen. Table V shows the multiplier 
functions found to be experimentally optimum for quantizing speech 
waveforms. Using this adaptive scheme (adpcm) and these multipliers, 
the quantization of the difference signals can also be efficiently achieved 
even when the initial step size is a poor estimate of signal variance. 
Table VI is a comparison of the signal-to-noise ratio for the adaptive 
scheme with a crude initial estimate of step size and the optimum 
quantizer discussed in Section IV. The results in Table VI are an 
encouraging demonstration that it is not necessary to know the sta- 
tistical structure of the difference signal to efficiently quantize the 
signal. In fact, it can be shown that, if the properties of the signal are 
nonstationary, the adaptive method is more suitable than the scheme 
used in Section IV. 

It should be noted that the above scheme does not apply for one-bit 
quantization (B = 1). A simple strategy for one-bit quantization has 
been developed by Jayant. 14 Let c r and c r -i denote the values of suc- 
cessive bits in a one-bit scheme, then 



A r = A r _ x P< 



(13) 



where P has the typical value P = 1.5. Although this method was 
developed for quantizing speech waveforms, it performs quite well 
in quantizing the parameter difference signals. A comparison of this 
method and the optimum technique is shown in Table VII. Again, the 
adaptive scheme works well even with a poor initial estimate of signal 
variance. 

5.3 Synthesis 

To subjectively evaluate the performance of the adaptive methods 
suggested in this section, several speech utterances were synthesized. 
The synthesis scheme was again the one described by Atal and Han- 
auer, 1 but an adaptive quantizer and a second-order adaptive predic- 

Table V — Step size multipliers for B = 2, 3, and 4 (Ref. 7) 





2 


3 


4 


ilf(l) 


0.80 


0.90 


0.90 


Af(2) 


1.60 


0.90 


0.90 


M(3) 




1.25 


0.90 


Af(4) 




1.75 


0.90 


M(5) 






1.20 


M(6) 






1.60 


urn 






2.00 


M(S) 






2.40 
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Table VI — Comparison of the signal-to-noise ratio for the 

adaptive quantizer with crude initial estimate of step 

size and the optimum gaussian signal uniform 

quantizer. The analysis is for the sentence, 

"May we all learn a yellow lion roar." 



Bits 


<7i 


92 


03 


Adaptive 


Optimum 


Adaptive 


Optimum 


Adaptive 


Optimum 


2 
3 
4 


12.6 
18.0 
22.8 


13.6 
20.4 
21.9 


18.3 
21.8 
24.9 


18.4 
21.8 
23.9 


15.6 
19.2 
24.0 


16.5 
20.0 
23.1 



tion dpcm technique was used to transmit the lpc parameters. The 
initial estimates of the predictors were 6i = 1.0 and b 2 = 0.0. A second- 
order analysis was performed because adaptive prediction makes the 
G op t function saturate at a larger value than a nonadaptive predictor. 7 
The initial estimate of the quantizer step size was set equal to the 
standard deviations of the parameters listed in Table IV. For param- 
eters in which the quantizer uses only one bit, a first-order system 
with &i = 1.0 was used. 

Employing the bit assignment cited in Section IV, the quality of the 
synthetic speech was only slightly worse than the optimum scheme. 
Figure 15 shows a comparison of one example of the optimum scheme 
and the adaptive method. To achieve the performance of the optimum 
scheme, it has been found necessary to allot approximately one bit 
more per frame to the most sensitive parameters (usually pitch and 
power) . 

VI. ORTHOGONAL LINEAR PREDICTION 

In the dpcm method of transmission, the value of the parameter x r 
is predicted from previous values of the given parameter. However, 

Table VII — Comparison of the signal-to-noise ratio for a 

one-bit adaptive quantizer and optimum one-bit 

gaussian signal uniform quantizer 



ffl 


92 


g» 


Adaptive 


Optimum 


Adaptive 


Optimum 


Adaptive 


Optimum 


7.3 


8.5 


8.8 


9.9 


8.2 


9.7 
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the lpc parameters have been experimentally determined to be quite 
redundant. 16 Thus, the parameter x r can be predicted not only from 
its own past values but also in some sense from the values of the other 
lpc parameter. A more efficient method of transmission can then be 
obtained by exploiting all the available information about a given 
parameter. 

One means of exploiting the redundancy among the lpc parameters 
is to generate a set of orthogonal parameters that are linear combina- 
tions of the original set. The new parameters are uniquely (one to one) 
related to the lpc parameters and are calculated to be independent of 
each other and therefore do not contain any mutual information. If 
the original parameters are redundant, only a small subset of the 
orthogonal parameter will demonstrate any significant frame-to-frame 
variation. The process of obtaining the appropriate orthogonal param- 
eters is referred to as an eigenvector analysis. 16 The orthogonal param- 
eters are termed eigenvectors, and each vector's statistical variance 
is termed the eigenvalue of the eigenvector. 

To determine the eigenvectors, we first calculate the covariance 
matrix of the log-area parameters R across the utterance. If we denote 
Qa as the ith log-area parameter in the j'th frame, then the elements 
of R are 



1 
r<* - W 

where 



N 

'"«■* - N _ t £ (Qa - mi)(g kj - m k ), 



1 



A' 



mi = NL 9 « 



and N is the number of frames in the utterance. Given the covariance 
matrix, the set of eigenvalues X, are found by solving the set of simul- 
taneous equations 

|R — Xlf =0, 

where I is the identity matrix and |A| denotes the determinant of the 
matrix A. The eigenvectors <p, are then found as solutions of the 
equation 

X,<5\ = R*,. 

To illustrate the behavior of the lpc parameters and the correspond- 
ing orthogonal parameters, Table VIII contains a listing of the typical 
variance (eigenvalues) of each calculated eigenvector parameter across 
the four utterances examined. The redundancy in the original log-area 
coefficients is reflected in the fact that more than 90 percent of the 
total statistical variance is contained in the first five or six eigenvectors. 
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Table VIII — Measured eigenvalues for the four 
sentences analyzed: 

A. Few thieves are never sent to the jug. 

B. May we all learn a yellow lion roar. 

C. It's time we rounded up that herd of Asian cattle. 

D. Should we chase those young outlaw cowboys? 





A 


B 


C 


D 


1 


2.62 


2.23 


1.75 


2.75 


2 


1.44 


0.80 


1.29 


1.58 


3 


0.67 


0.54 


0.85 


0.6 


4 


0.44 


0.38 


0.52 


0.36 


5 


0.25 


0.32 


0.31 


0.28 


6 


0.21 


0.24 


0.17 


0.22 


7 


0.10 


0.12 


0.13 


0.16 


8 


0.09 


0.10 


0.10 


0.15 


9 


0.08 


0.08 


0.08 


0.09 


10 


0.06 


0.05 


0.06 


0.06 


11 


0.03 


0.04 


0.04 


0.06 


12 


0.02 


0.01 


0.02 


0.03 



The higher numbered orthogonal parameters have a relatively small 
variance and can therefore be considered essentially constant through- 
out the utterance. Thus, the total information in the 12 log-area 
parameters can be effectively represented in the space of only the first 
six eigenvectors. 

The redundancy in the lpc parameters is not surprising in view of 
the fact that the speech signal can be synthesized with only three 
formant parameters (F h F 2 , F 3 ). Thus, the information contained in 
the 12 log-area coefficients are effectively duplicated in the space of 
only three formant parameters. The method of orthogonal linear pre- 
diction can be viewed as a constraint technique for squeezing the 
original parameters into a smaller but informationally equivalent set 
of parameters. The informationally equivalent set is formed by the 
most significant orthogonal parameters (significance is measured in 
terms of the standard deviation, or eigenvalue, of the orthogonal 
parameters) . 

Experimental studies of a variety of speech utterances have shown 
that quite acceptable quality synthesis can be generated by trans- 
mitting only the six most significant orthogonal parameters, pitch, 
and power. The synthesis is performed by calculating the lpc param- 
eters from the transmitted orthogonal parameters and a priori knowl- 
edge of the average values of the least significant orthogonal param- 
eters. For acceptable quality synthesis, only 22 bits/frame are needed. 
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The allotment of bits was as follows : 

Pitch : 3 bits/frame 
Power: 3 bits/frame 
Unvoiced-voice: 1 bit /frame 
First orthogonal parameter : 4 bits /frame 
Second orthogonal parameter: 3 bits/frame 
Third orthogonal parameter : 3 bits/frame 
Fourth orthogonal parameter : 2 bits/frame 
Fifth orthogonal parameter : 2 bits/frame 
Sixth orthogonal parameter : 1 bit/frame. 

The total transmission storage requirement in this technique is 
1100 bits/s for the synthesis parameters, 100 bits/s for the dpcm in- 
formation, and an initial one-time investment of 240 bits' for the 
necessary eigenvector information. Figures 16 to 18 illustrate the 
synthetic speech spectrograms generated by this technique for the 
examples previously examined. Depending on the speaker and the 
utterance, the bit rate for the synthesis parameters can be reduced to 
between 600 and 1000 bits/s and still yield acceptable quality speech. 
The low bit rate required for orthogonal linear prediction is quite 
attractive, but unfortunately this method involves a complex eigen- 
vector analysis and a delay in transmission to collect the statistical 
data necessary for the calculation of the eigenvectors. 

VII. SUMMARY AND CONCLUSIONS 

The goal of this paper was the development of a more efficient 
method of transmitting the lpc parameters. One proposed method 
involved the use of dpcm techniques. In dpcm transmission, we take 
advantage of the predictability of the parameter from its previous 
values to develop a more effective transmission scheme. Acceptable 
quality synthetic speech can be generated with dpcm by allotting 
between 1000 and 1500 bits/s. This rate of information transmission 
is significantly better than the bit rates necessary for the conven- 
tional pcm methods. 

To enhance the practical application of the dpcm system, the 
methods of adaptive quantization and adaptive prediction were dis- 
cussed. These methods allow the on-line calculation of the dpcm 
predictors and quantizer step size. To further decrease the storage re- 



* Four bits for the average value of each of the six least significant parameters 
(24 bits) and three bits for each of the 12 coefficients required to compute each 
orthogonal parameter from the log-area coefficients (36 X 6 = 216 bits). 
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quirements of the lpc vocoder, the redundancy of the log-area param- 
eters was exploited. By transmitting only the most significant eigen- 
vectors, a considerable saving in bit rate can be achieved. 

The techniques discussed in this paper are not limited to the trans- 
mission of the lpc parameters, but can also be used in conjunction 
with other vocoder systems. For example, the bit rate of a formant 
vocoder 4 can be reduced using a dpcm scheme for transmitting the 
necessary information. These transmission techniques have wide 
application and can prove very beneficial in a variety of synthesis 
schemes. 
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